Problem statement¶

Problem Statement - Part I This assignment contains two parts. Part-I is a programming assignment (to be submitted in a Jupyter Notebook), and Part-II includes subjective questions (to be submitted in a PDF file).

Part-II is given on the next page.

Assignment Part-I A US-based housing company named Surprise Housing has decided to enter the Australian market. The company uses data analytics to purchase houses at a price below their actual values and flip them on at a higher price. For the same purpose, the company has collected a data set from the sale of houses in Australia. The data is provided in the CSV file below.

The company is looking at prospective properties to buy to enter the market. You are required to build a regression model using regularisation in order to predict the actual value of the prospective properties and decide whether to invest in them or not.

The company wants to know:

Which variables are significant in predicting the price of a house, and

How well those variables describe the price of a house.

Also, determine the optimal value of lambda for ridge and lasso regression.

Business Goal

You are required to model the price of houses with the available independent variables. This model will then be used by the management to understand how exactly the prices vary with the variables. They can accordingly manipulate the strategy of the firm and concentrate on areas that will yield high returns. Further, the model will be a good way for management to understand the pricing dynamics of a new market.

In [1]:
######## Importing necessary libraries and reading data </h2>
In [2]:
import pandas as pd #Data Processing
import numpy as np #Linear Algebra
import seaborn as sns #Data Visualization
import matplotlib.pyplot as plt #Data Visualization
In [3]:
import warnings #Warnings
warnings.filterwarnings ("ignore") #Warnings
pd.set_option('display.max_rows', None)# to display all the rows
#pd.options.display.float_format = '{:.2f}'.format
In [5]:
df_house = pd.read_csv("train.csv")
df_house.head() 
Out[5]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [6]:
df_house.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     588 non-null    object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
In [7]:
df_house.shape
Out[7]:
(1460, 81)

EDA¶

In [8]:
print("duplicate rows : ",df_house.duplicated().sum())
print("columns in which all values are null",df_house.isnull().all(axis=1).sum())
print("rows in which all values are null",df_house.isnull().all(axis=0).sum())
duplicate rows :  0
columns in which all values are null 0
rows in which all values are null 0
In [9]:
100*df_house.isnull().mean().sort_values(ascending=False)#rows in which most of the values are null
Out[9]:
PoolQC           99.520548
MiscFeature      96.301370
Alley            93.767123
Fence            80.753425
MasVnrType       59.726027
FireplaceQu      47.260274
LotFrontage      17.739726
GarageYrBlt       5.547945
GarageCond        5.547945
GarageType        5.547945
GarageFinish      5.547945
GarageQual        5.547945
BsmtFinType2      2.602740
BsmtExposure      2.602740
BsmtQual          2.534247
BsmtCond          2.534247
BsmtFinType1      2.534247
MasVnrArea        0.547945
Electrical        0.068493
Id                0.000000
Functional        0.000000
Fireplaces        0.000000
KitchenQual       0.000000
KitchenAbvGr      0.000000
BedroomAbvGr      0.000000
HalfBath          0.000000
FullBath          0.000000
BsmtHalfBath      0.000000
TotRmsAbvGrd      0.000000
GarageCars        0.000000
GrLivArea         0.000000
GarageArea        0.000000
PavedDrive        0.000000
WoodDeckSF        0.000000
OpenPorchSF       0.000000
EnclosedPorch     0.000000
3SsnPorch         0.000000
ScreenPorch       0.000000
PoolArea          0.000000
MiscVal           0.000000
MoSold            0.000000
YrSold            0.000000
SaleType          0.000000
SaleCondition     0.000000
BsmtFullBath      0.000000
HeatingQC         0.000000
LowQualFinSF      0.000000
LandSlope         0.000000
OverallQual       0.000000
HouseStyle        0.000000
BldgType          0.000000
Condition2        0.000000
Condition1        0.000000
Neighborhood      0.000000
LotConfig         0.000000
YearBuilt         0.000000
Utilities         0.000000
LandContour       0.000000
LotShape          0.000000
Street            0.000000
LotArea           0.000000
MSZoning          0.000000
OverallCond       0.000000
YearRemodAdd      0.000000
2ndFlrSF          0.000000
BsmtFinSF2        0.000000
1stFlrSF          0.000000
CentralAir        0.000000
MSSubClass        0.000000
Heating           0.000000
TotalBsmtSF       0.000000
BsmtUnfSF         0.000000
BsmtFinSF1        0.000000
RoofStyle         0.000000
Foundation        0.000000
ExterCond         0.000000
ExterQual         0.000000
Exterior2nd       0.000000
Exterior1st       0.000000
RoofMatl          0.000000
SalePrice         0.000000
dtype: float64
Removing columns which have more than 30 % missing values as they are not beneficial, because even if we impute them, most of the values remain same and which will not help in our analysis.¶
In [10]:
df_house.shape
Out[10]:
(1460, 81)
In [11]:
rm_cols = df_house.columns[df_house.isnull().mean() * 100 >= 30.00 ]
df_house.drop(rm_cols, axis=1, inplace=True)
In [12]:
df_house.shape
Out[12]:
(1460, 75)
In [13]:
#Validating the null values again
100*df_house.isnull().mean().sort_values(ascending=False)
Out[13]:
LotFrontage      17.739726
GarageYrBlt       5.547945
GarageCond        5.547945
GarageType        5.547945
GarageFinish      5.547945
GarageQual        5.547945
BsmtFinType2      2.602740
BsmtExposure      2.602740
BsmtFinType1      2.534247
BsmtCond          2.534247
BsmtQual          2.534247
MasVnrArea        0.547945
Electrical        0.068493
WoodDeckSF        0.000000
PavedDrive        0.000000
LowQualFinSF      0.000000
GrLivArea         0.000000
BsmtFullBath      0.000000
BsmtHalfBath      0.000000
FullBath          0.000000
HalfBath          0.000000
SaleCondition     0.000000
BedroomAbvGr      0.000000
SaleType          0.000000
YrSold            0.000000
MoSold            0.000000
MiscVal           0.000000
KitchenAbvGr      0.000000
KitchenQual       0.000000
TotRmsAbvGrd      0.000000
PoolArea          0.000000
Functional        0.000000
Fireplaces        0.000000
ScreenPorch       0.000000
2ndFlrSF          0.000000
3SsnPorch         0.000000
GarageCars        0.000000
GarageArea        0.000000
EnclosedPorch     0.000000
OpenPorchSF       0.000000
Id                0.000000
Heating           0.000000
1stFlrSF          0.000000
OverallCond       0.000000
MSZoning          0.000000
LotArea           0.000000
Street            0.000000
LotShape          0.000000
LandContour       0.000000
Utilities         0.000000
LotConfig         0.000000
LandSlope         0.000000
Neighborhood      0.000000
Condition1        0.000000
Condition2        0.000000
BldgType          0.000000
HouseStyle        0.000000
OverallQual       0.000000
YearBuilt         0.000000
CentralAir        0.000000
YearRemodAdd      0.000000
RoofStyle         0.000000
RoofMatl          0.000000
Exterior1st       0.000000
Exterior2nd       0.000000
ExterQual         0.000000
ExterCond         0.000000
Foundation        0.000000
BsmtFinSF1        0.000000
BsmtFinSF2        0.000000
BsmtUnfSF         0.000000
TotalBsmtSF       0.000000
MSSubClass        0.000000
HeatingQC         0.000000
SalePrice         0.000000
dtype: float64

Replace the missing values for numarical columns¶

In [14]:
null_cols = df_house.columns[df_house.isnull().any()]
df_house[null_cols].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   LotFrontage   1201 non-null   float64
 1   MasVnrArea    1452 non-null   float64
 2   BsmtQual      1423 non-null   object 
 3   BsmtCond      1423 non-null   object 
 4   BsmtExposure  1422 non-null   object 
 5   BsmtFinType1  1423 non-null   object 
 6   BsmtFinType2  1422 non-null   object 
 7   Electrical    1459 non-null   object 
 8   GarageType    1379 non-null   object 
 9   GarageYrBlt   1379 non-null   float64
 10  GarageFinish  1379 non-null   object 
 11  GarageQual    1379 non-null   object 
 12  GarageCond    1379 non-null   object 
dtypes: float64(3), object(10)
memory usage: 148.4+ KB
In [15]:
df_house['LotFrontage'].describe(percentiles=[.25, .5, .75, .90, .95, .99])
Out[15]:
count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
90%        96.000000
95%       107.000000
99%       141.000000
max       313.000000
Name: LotFrontage, dtype: float64
In [16]:
print(df_house['MasVnrArea'].describe(percentiles=[.25, .5, .75, .90, .95, .99]))
print(df_house['MasVnrArea'].median())

print(df_house['GarageYrBlt'].describe(percentiles=[.25, .5, .75, .90, .95, .99]))
print(df_house['GarageYrBlt'].median())
count    1452.000000
mean      103.685262
std       181.066207
min         0.000000
25%         0.000000
50%         0.000000
75%       166.000000
90%       335.000000
95%       456.000000
99%       791.920000
max      1600.000000
Name: MasVnrArea, dtype: float64
0.0
count    1379.000000
mean     1978.506164
std        24.689725
min      1900.000000
25%      1961.000000
50%      1980.000000
75%      2002.000000
90%      2006.000000
95%      2007.000000
99%      2009.000000
max      2010.000000
Name: GarageYrBlt, dtype: float64
1980.0
In [17]:
df_house[["LotFrontage", "MasVnrArea",'GarageYrBlt']].plot(kind= "box", subplots= True)
plt.show()
df_house[["LotFrontage", "MasVnrArea",'GarageYrBlt']].median()
No description has been provided for this image
Out[17]:
LotFrontage      69.0
MasVnrArea        0.0
GarageYrBlt    1980.0
dtype: float64
In [18]:
# We will replace the null with median for LotFrontage MasVnrArea and GarageYrBlt columns
df_house["LotFrontage"].fillna(df_house["LotFrontage"].median(), inplace=True)
df_house["MasVnrArea"].fillna(df_house["MasVnrArea"].mean(), inplace=True)
df_house["GarageYrBlt"].fillna(df_house["GarageYrBlt"].median(), inplace=True)

Replaceing categorical columns¶

In [19]:
# Electrical system type  Filling the Electrical with the mode
df_house['Electrical'] = df_house['Electrical'].fillna(df_house['Electrical'].mode()[0])
In [20]:
#Validating the null values again
df_house.columns[df_house.isnull().any()]
Out[20]:
Index(['BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'],
      dtype='object')
In [21]:
#As mentioned in the Data Dictionary NA value means it is not present and thus we can replace it with none
null_with_meaning = ["BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", "GarageType", "GarageFinish", "GarageQual", "GarageCond"]
for i in null_with_meaning:
    df_house[i].fillna("none", inplace=True)
In [22]:
null_cols = df_house.columns[df_house.isnull().any()]
df_house[null_cols].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Empty DataFrame
In [23]:
df_house.drop('Id',axis=1,inplace=True)
In [24]:
#df_house = df_house.round(decimals = 2)
df_house.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 74 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1460 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   LotShape       1460 non-null   object 
 6   LandContour    1460 non-null   object 
 7   Utilities      1460 non-null   object 
 8   LotConfig      1460 non-null   object 
 9   LandSlope      1460 non-null   object 
 10  Neighborhood   1460 non-null   object 
 11  Condition1     1460 non-null   object 
 12  Condition2     1460 non-null   object 
 13  BldgType       1460 non-null   object 
 14  HouseStyle     1460 non-null   object 
 15  OverallQual    1460 non-null   int64  
 16  OverallCond    1460 non-null   int64  
 17  YearBuilt      1460 non-null   int64  
 18  YearRemodAdd   1460 non-null   int64  
 19  RoofStyle      1460 non-null   object 
 20  RoofMatl       1460 non-null   object 
 21  Exterior1st    1460 non-null   object 
 22  Exterior2nd    1460 non-null   object 
 23  MasVnrArea     1460 non-null   float64
 24  ExterQual      1460 non-null   object 
 25  ExterCond      1460 non-null   object 
 26  Foundation     1460 non-null   object 
 27  BsmtQual       1460 non-null   object 
 28  BsmtCond       1460 non-null   object 
 29  BsmtExposure   1460 non-null   object 
 30  BsmtFinType1   1460 non-null   object 
 31  BsmtFinSF1     1460 non-null   int64  
 32  BsmtFinType2   1460 non-null   object 
 33  BsmtFinSF2     1460 non-null   int64  
 34  BsmtUnfSF      1460 non-null   int64  
 35  TotalBsmtSF    1460 non-null   int64  
 36  Heating        1460 non-null   object 
 37  HeatingQC      1460 non-null   object 
 38  CentralAir     1460 non-null   object 
 39  Electrical     1460 non-null   object 
 40  1stFlrSF       1460 non-null   int64  
 41  2ndFlrSF       1460 non-null   int64  
 42  LowQualFinSF   1460 non-null   int64  
 43  GrLivArea      1460 non-null   int64  
 44  BsmtFullBath   1460 non-null   int64  
 45  BsmtHalfBath   1460 non-null   int64  
 46  FullBath       1460 non-null   int64  
 47  HalfBath       1460 non-null   int64  
 48  BedroomAbvGr   1460 non-null   int64  
 49  KitchenAbvGr   1460 non-null   int64  
 50  KitchenQual    1460 non-null   object 
 51  TotRmsAbvGrd   1460 non-null   int64  
 52  Functional     1460 non-null   object 
 53  Fireplaces     1460 non-null   int64  
 54  GarageType     1460 non-null   object 
 55  GarageYrBlt    1460 non-null   float64
 56  GarageFinish   1460 non-null   object 
 57  GarageCars     1460 non-null   int64  
 58  GarageArea     1460 non-null   int64  
 59  GarageQual     1460 non-null   object 
 60  GarageCond     1460 non-null   object 
 61  PavedDrive     1460 non-null   object 
 62  WoodDeckSF     1460 non-null   int64  
 63  OpenPorchSF    1460 non-null   int64  
 64  EnclosedPorch  1460 non-null   int64  
 65  3SsnPorch      1460 non-null   int64  
 66  ScreenPorch    1460 non-null   int64  
 67  PoolArea       1460 non-null   int64  
 68  MiscVal        1460 non-null   int64  
 69  MoSold         1460 non-null   int64  
 70  YrSold         1460 non-null   int64  
 71  SaleType       1460 non-null   object 
 72  SaleCondition  1460 non-null   object 
 73  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(34), object(37)
memory usage: 844.2+ KB

Deriving variables¶

In [25]:
#Overall area for all floors and basement plays an important role, hence creating total area in square foot column
df_house['Total_sqr_footage'] = (df_house['BsmtFinSF1'] + df_house['BsmtFinSF2'] + df_house['1stFlrSF'] + df_house['2ndFlrSF']) + df_house['GrLivArea']
# Creating derived column for total number of bathrooms column
df_house['Total_Bathrooms'] = (df_house['FullBath'] + (0.5 * df_house['HalfBath']) + df_house['BsmtFullBath'] + (0.5 * df_house['BsmtHalfBath']))
#Creating derived column for total porch area 
df_house['Total_porch_sf'] = (df_house['OpenPorchSF'] + df_house['3SsnPorch'] + df_house['EnclosedPorch'] + df_house['ScreenPorch'] + df_house['WoodDeckSF'])
In [26]:
#Lets drop these extra columns :
extraCols = ['BsmtFinSF1','BsmtFinSF2','1stFlrSF','2ndFlrSF','GrLivArea','FullBath','HalfBath','BsmtFullBath','BsmtHalfBath','OpenPorchSF','3SsnPorch','EnclosedPorch','ScreenPorch','WoodDeckSF']
df_house.drop(extraCols,axis=1,inplace=True)
df_house.shape# verifying the shape of the dataset
Out[26]:
(1460, 63)
In [27]:
# Creating a new Column to determine the age of the property
df_house['Total_Age']=df_house['YrSold']-df_house['YearBuilt']
df_house['Garage_age'] = df_house['YrSold'] - df_house['GarageYrBlt']
df_house['Remodel_age'] = df_house['YrSold'] - df_house['YearRemodAdd']
#Also lets drop out variables like GarageYrBlt and YearRemodAdd as we are already calculating the number of years
drop_cols = ['GarageYrBlt','YearRemodAdd','YearBuilt']
df_house.drop(labels = drop_cols, axis = 1, inplace=True) #Dropping the columns added in the list
print("The new size of the data is" , df_house.shape) #Printing the new Dataset Shape
The new size of the data is (1460, 63)

Correlation of Numerical columns¶

In [28]:
# Checking the corelation
numeric_columns = df_house.select_dtypes(include=[np.number])
plt.subplots(figsize = (25,20))
#Plotting heatmap of numerical features
sns.heatmap(round(numeric_columns.corr(),2), cmap='coolwarm' , annot=True, center = 0)
plt.show()
No description has been provided for this image
In [29]:
numeric_columns = df_house.select_dtypes(include=[np.number])
important_num_cols = list(numeric_columns.corr()["SalePrice"][(numeric_columns.corr()["SalePrice"]>0.50) | (numeric_columns.corr()["SalePrice"]<-0.50)].index)
important_num_cols #columns highly correlated with SalePRice
Out[29]:
['OverallQual',
 'TotalBsmtSF',
 'TotRmsAbvGrd',
 'GarageCars',
 'GarageArea',
 'SalePrice',
 'Total_sqr_footage',
 'Total_Bathrooms',
 'Total_Age',
 'Remodel_age']
In [30]:
plt.figure(figsize = (70, 90))
sns.pairplot(df_house, vars= important_num_cols)
plt.show()
<Figure size 7000x9000 with 0 Axes>
No description has been provided for this image

Outlier Analysis¶

In [31]:
#Lets divide the Columns based on Numerical/continous and categorical
Cat_cols = []
Num_cols = []
for i in df_house.columns :
    if df_house[i].dtype == "object":
        Cat_cols.append(i)
    else:
        Num_cols.append(i)
cat_info_df = pd.DataFrame({
    'Categorical Column': Cat_cols,
    'Info': [df_house[col].dtypes for col in Cat_cols],
    'Num Unique': [df_house[col].nunique() for col in Cat_cols]
})
print(cat_info_df)
num_info_df = pd.DataFrame({
    'Numerical Column': Num_cols,
    'Info': [df_house[col].dtypes for col in Num_cols],
    'Num Unique': [df_house[col].nunique() for col in Num_cols]
})
print(num_info_df)
   Categorical Column    Info  Num Unique
0            MSZoning  object           5
1              Street  object           2
2            LotShape  object           4
3         LandContour  object           4
4           Utilities  object           2
5           LotConfig  object           5
6           LandSlope  object           3
7        Neighborhood  object          25
8          Condition1  object           9
9          Condition2  object           8
10           BldgType  object           5
11         HouseStyle  object           8
12          RoofStyle  object           6
13           RoofMatl  object           8
14        Exterior1st  object          15
15        Exterior2nd  object          16
16          ExterQual  object           4
17          ExterCond  object           5
18         Foundation  object           6
19           BsmtQual  object           5
20           BsmtCond  object           5
21       BsmtExposure  object           5
22       BsmtFinType1  object           7
23       BsmtFinType2  object           7
24            Heating  object           6
25          HeatingQC  object           5
26         CentralAir  object           2
27         Electrical  object           5
28        KitchenQual  object           4
29         Functional  object           7
30         GarageType  object           7
31       GarageFinish  object           4
32         GarageQual  object           6
33         GarageCond  object           6
34         PavedDrive  object           3
35           SaleType  object           9
36      SaleCondition  object           6
     Numerical Column     Info  Num Unique
0          MSSubClass    int64          15
1         LotFrontage  float64         110
2             LotArea    int64        1073
3         OverallQual    int64          10
4         OverallCond    int64           9
5          MasVnrArea  float64         328
6           BsmtUnfSF    int64         780
7         TotalBsmtSF    int64         721
8        LowQualFinSF    int64          24
9        BedroomAbvGr    int64           8
10       KitchenAbvGr    int64           4
11       TotRmsAbvGrd    int64          12
12         Fireplaces    int64           4
13         GarageCars    int64           5
14         GarageArea    int64         441
15           PoolArea    int64           8
16            MiscVal    int64          21
17             MoSold    int64          12
18             YrSold    int64           5
19          SalePrice    int64         663
20  Total_sqr_footage    int64        1124
21    Total_Bathrooms  float64          10
22     Total_porch_sf    int64         427
23          Total_Age    int64         122
24         Garage_age  float64         101
25        Remodel_age    int64          62
In [32]:
#Lets plot SalePrice against all the categorical columns
plt.figure(figsize=(30,80))#The size of the plot
c=0
num_cols = 3
num_rows = (len(Cat_cols) - 1) // num_cols + 1  # Calculate the number of rows needed
for i in Cat_cols:
    c=c+1
    plt.subplot(num_rows,num_cols,c)
    sns.boxplot(x = 'SalePrice', y = df_house[str(i)], data = df_house)
    plt.title(str(i)+" VS SalePrice - Box Plot\n",fontsize=20)#The title of the plot    
plt.tight_layout()#to avoid overlapping layout
plt.show()#to display the plot
No description has been provided for this image
In [33]:
#As we can observe there are outlier in many columns, listing them below
outlier = ['LotFrontage','LotArea','Total_sqr_footage','Total_porch_sf']
for i in outlier:
    qnt = df_house[i].quantile(0.98)#removing data above 98 percentile
    df_house = df_house[df_house[i] < qnt]
In [34]:
df_house.shape
Out[34]:
(1343, 63)

Data Preparation¶

Creating dummy columns for categorical columns¶

In [35]:
df_house = pd.get_dummies(df_house,drop_first=True)
df_house.info()#displaying the updated Datatypes
<class 'pandas.core.frame.DataFrame'>
Index: 1343 entries, 0 to 1458
Columns: 225 entries, MSSubClass to SaleCondition_Partial
dtypes: bool(199), float64(4), int64(22)
memory usage: 544.3 KB
In [36]:
df_house.head()
Out[36]:
MSSubClass LotFrontage LotArea OverallQual OverallCond MasVnrArea BsmtUnfSF TotalBsmtSF LowQualFinSF BedroomAbvGr ... SaleType_ConLI SaleType_ConLw SaleType_New SaleType_Oth SaleType_WD SaleCondition_AdjLand SaleCondition_Alloca SaleCondition_Family SaleCondition_Normal SaleCondition_Partial
0 60 65.0 8450 7 5 196.0 150 856 0 3 ... False False False False True False False False True False
1 20 80.0 9600 6 8 0.0 284 1262 0 3 ... False False False False True False False False True False
2 60 68.0 11250 7 5 162.0 434 920 0 3 ... False False False False True False False False True False
3 70 60.0 9550 7 5 0.0 540 756 0 3 ... False False False False True False False False False False
4 60 84.0 14260 8 5 350.0 490 1145 0 4 ... False False False False True False False False True False

5 rows × 225 columns

In [37]:
bool_columns = df_house.select_dtypes(include=['bool'])
def map_bool(col):
    return col.map({True:1,False:0})
df_house[bool_columns.columns] =  df_house[bool_columns.columns].apply(map_bool)
In [38]:
df_house.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1343 entries, 0 to 1458
Columns: 225 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(4), int64(221)
memory usage: 2.3 MB
In [39]:
plt.figure(figsize=(16,6))
sns.distplot(df_house.SalePrice)
plt.show()
No description has been provided for this image
In [40]:
df_house.shape
Out[40]:
(1343, 225)

Dividing the Data in terms of TRAIN and TEST.¶

In [41]:
# Using Sklearn and stats model for modeling
In [42]:
from sklearn.model_selection import train_test_split #for spliting the data in terms of train and test
from sklearn.preprocessing import MinMaxScaler #for performing minmax scaling on the continous variables of training data
from sklearn.feature_selection import RFE #for performing automated Feature Selection
from sklearn.linear_model import LinearRegression #to build linear model
from sklearn.linear_model import Ridge #for ridge regularization
from sklearn.linear_model import Lasso #for lasso regularization
from sklearn.model_selection import GridSearchCV #finding the optimal parameter values
from sklearn.metrics import r2_score #for calculating the r-square value
import statsmodels.api as sm #for add the constant value
from sklearn import metrics
from statsmodels.stats.outliers_influence import variance_inflation_factor #to calculate the VIF
from sklearn.metrics import mean_squared_error #for calculating the mean squared error
In [43]:
#Here we will keep the train size as 70% and automatically test size will be rest 30%
#Also we will keep random_state as a fixed value of 100 so that the data set does no changes
df_train,df_test = train_test_split(df_house, train_size = 0.7, random_state = 100)
print ("The Size of Train data is",df_train.shape)
print ("The Size of Test data is",df_test.shape)
The Size of Train data is (940, 225)
The Size of Test data is (403, 225)

Scaling¶

In [44]:
Scaler = MinMaxScaler() # Instantiate an objectr
#Note-The above order of columns in num_cols should be same in df_test, otherwise we will get a wrong r-square value
df_train[Num_cols] = Scaler.fit_transform(df_train[Num_cols])
df_test[Num_cols] = Scaler.transform(df_test[Num_cols])
In [45]:
#Define X_train and y_train
y_train = df_train.pop('SalePrice') #This contains only the Target Variable
X_train = df_train #This contains all the Independent Variables except the Target Variable
#Since 'SalePrice' is the target variable we will keep it only on y-train and remove it from X_train
In [46]:
y_test = df_test.pop('SalePrice')
X_test = df_test

Model :1 Automated Process using RFE and VIF¶

In [47]:
X_train.shape
Out[47]:
(940, 224)
In [48]:
#Fit the Model
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=100)
rfe = rfe.fit(X_train,y_train)
#View the support_ and rank_ 
list(zip(X_train.columns,rfe.support_,rfe.ranking_))
Out[48]:
[('MSSubClass', False, 20),
 ('LotFrontage', False, 97),
 ('LotArea', True, 1),
 ('OverallQual', True, 1),
 ('OverallCond', True, 1),
 ('MasVnrArea', False, 48),
 ('BsmtUnfSF', True, 1),
 ('TotalBsmtSF', False, 8),
 ('LowQualFinSF', False, 55),
 ('BedroomAbvGr', True, 1),
 ('KitchenAbvGr', True, 1),
 ('TotRmsAbvGrd', True, 1),
 ('Fireplaces', False, 32),
 ('GarageCars', False, 119),
 ('GarageArea', True, 1),
 ('PoolArea', True, 1),
 ('MiscVal', True, 1),
 ('MoSold', False, 106),
 ('YrSold', False, 100),
 ('Total_sqr_footage', True, 1),
 ('Total_Bathrooms', True, 1),
 ('Total_porch_sf', False, 25),
 ('Total_Age', True, 1),
 ('Garage_age', False, 58),
 ('Remodel_age', False, 61),
 ('MSZoning_FV', True, 1),
 ('MSZoning_RH', False, 4),
 ('MSZoning_RL', True, 1),
 ('MSZoning_RM', False, 6),
 ('Street_Pave', True, 1),
 ('LotShape_IR2', False, 120),
 ('LotShape_IR3', False, 15),
 ('LotShape_Reg', False, 77),
 ('LandContour_HLS', False, 84),
 ('LandContour_Low', False, 18),
 ('LandContour_Lvl', False, 46),
 ('Utilities_NoSeWa', True, 1),
 ('LotConfig_CulDSac', False, 75),
 ('LotConfig_FR2', False, 35),
 ('LotConfig_FR3', False, 112),
 ('LotConfig_Inside', False, 114),
 ('LandSlope_Mod', False, 111),
 ('LandSlope_Sev', True, 1),
 ('Neighborhood_Blueste', False, 66),
 ('Neighborhood_BrDale', False, 90),
 ('Neighborhood_BrkSide', False, 122),
 ('Neighborhood_ClearCr', False, 52),
 ('Neighborhood_CollgCr', False, 22),
 ('Neighborhood_Crawfor', True, 1),
 ('Neighborhood_Edwards', False, 9),
 ('Neighborhood_Gilbert', False, 12),
 ('Neighborhood_IDOTRR', False, 73),
 ('Neighborhood_MeadowV', True, 1),
 ('Neighborhood_Mitchel', True, 1),
 ('Neighborhood_NAmes', False, 10),
 ('Neighborhood_NPkVill', False, 38),
 ('Neighborhood_NWAmes', False, 7),
 ('Neighborhood_NoRidge', True, 1),
 ('Neighborhood_NridgHt', True, 1),
 ('Neighborhood_OldTown', False, 2),
 ('Neighborhood_SWISU', False, 24),
 ('Neighborhood_Sawyer', False, 21),
 ('Neighborhood_SawyerW', False, 23),
 ('Neighborhood_Somerst', True, 1),
 ('Neighborhood_StoneBr', True, 1),
 ('Neighborhood_Timber', False, 13),
 ('Neighborhood_Veenker', False, 91),
 ('Condition1_Feedr', True, 1),
 ('Condition1_Norm', True, 1),
 ('Condition1_PosA', False, 36),
 ('Condition1_PosN', True, 1),
 ('Condition1_RRAe', True, 1),
 ('Condition1_RRAn', True, 1),
 ('Condition1_RRNe', True, 1),
 ('Condition1_RRNn', False, 76),
 ('Condition2_Feedr', False, 47),
 ('Condition2_Norm', False, 56),
 ('Condition2_PosN', True, 1),
 ('Condition2_RRAe', True, 1),
 ('Condition2_RRAn', False, 27),
 ('Condition2_RRNn', False, 62),
 ('BldgType_2fmCon', False, 83),
 ('BldgType_Duplex', True, 1),
 ('BldgType_Twnhs', True, 1),
 ('BldgType_TwnhsE', True, 1),
 ('HouseStyle_1.5Unf', False, 57),
 ('HouseStyle_1Story', False, 89),
 ('HouseStyle_2.5Fin', True, 1),
 ('HouseStyle_2.5Unf', False, 124),
 ('HouseStyle_2Story', False, 64),
 ('HouseStyle_SFoyer', False, 121),
 ('HouseStyle_SLvl', False, 117),
 ('RoofStyle_Gable', False, 50),
 ('RoofStyle_Gambrel', False, 51),
 ('RoofStyle_Hip', False, 49),
 ('RoofStyle_Mansard', True, 1),
 ('RoofStyle_Shed', True, 1),
 ('RoofMatl_Metal', True, 1),
 ('RoofMatl_Roll', False, 70),
 ('RoofMatl_Tar&Grv', True, 1),
 ('RoofMatl_WdShake', False, 85),
 ('RoofMatl_WdShngl', True, 1),
 ('Exterior1st_AsphShn', True, 1),
 ('Exterior1st_BrkComm', False, 5),
 ('Exterior1st_BrkFace', False, 74),
 ('Exterior1st_CBlock', True, 1),
 ('Exterior1st_CemntBd', True, 1),
 ('Exterior1st_HdBoard', False, 31),
 ('Exterior1st_ImStucc', False, 14),
 ('Exterior1st_MetalSd', False, 72),
 ('Exterior1st_Plywood', False, 28),
 ('Exterior1st_Stone', True, 1),
 ('Exterior1st_Stucco', False, 42),
 ('Exterior1st_VinylSd', False, 71),
 ('Exterior1st_Wd Sdng', True, 1),
 ('Exterior1st_WdShing', False, 26),
 ('Exterior2nd_AsphShn', True, 1),
 ('Exterior2nd_Brk Cmn', False, 53),
 ('Exterior2nd_BrkFace', True, 1),
 ('Exterior2nd_CBlock', True, 1),
 ('Exterior2nd_CmentBd', True, 1),
 ('Exterior2nd_HdBoard', False, 109),
 ('Exterior2nd_ImStucc', False, 67),
 ('Exterior2nd_MetalSd', False, 110),
 ('Exterior2nd_Other', True, 1),
 ('Exterior2nd_Plywood', False, 86),
 ('Exterior2nd_Stone', False, 40),
 ('Exterior2nd_Stucco', False, 33),
 ('Exterior2nd_VinylSd', False, 108),
 ('Exterior2nd_Wd Sdng', True, 1),
 ('Exterior2nd_Wd Shng', False, 113),
 ('ExterQual_Fa', False, 88),
 ('ExterQual_Gd', True, 1),
 ('ExterQual_TA', True, 1),
 ('ExterCond_Fa', False, 17),
 ('ExterCond_Gd', False, 16),
 ('ExterCond_Po', True, 1),
 ('ExterCond_TA', False, 19),
 ('Foundation_CBlock', False, 79),
 ('Foundation_PConc', False, 78),
 ('Foundation_Slab', False, 87),
 ('Foundation_Stone', False, 65),
 ('Foundation_Wood', True, 1),
 ('BsmtQual_Fa', True, 1),
 ('BsmtQual_Gd', True, 1),
 ('BsmtQual_TA', True, 1),
 ('BsmtQual_none', True, 1),
 ('BsmtCond_Gd', False, 60),
 ('BsmtCond_Po', True, 1),
 ('BsmtCond_TA', False, 59),
 ('BsmtCond_none', True, 1),
 ('BsmtExposure_Gd', True, 1),
 ('BsmtExposure_Mn', False, 96),
 ('BsmtExposure_No', False, 95),
 ('BsmtExposure_none', True, 1),
 ('BsmtFinType1_BLQ', False, 103),
 ('BsmtFinType1_GLQ', False, 39),
 ('BsmtFinType1_LwQ', False, 81),
 ('BsmtFinType1_Rec', False, 82),
 ('BsmtFinType1_Unf', False, 118),
 ('BsmtFinType1_none', True, 1),
 ('BsmtFinType2_BLQ', False, 68),
 ('BsmtFinType2_GLQ', False, 80),
 ('BsmtFinType2_LwQ', False, 93),
 ('BsmtFinType2_Rec', False, 92),
 ('BsmtFinType2_Unf', False, 69),
 ('BsmtFinType2_none', True, 1),
 ('Heating_GasA', True, 1),
 ('Heating_GasW', True, 1),
 ('Heating_Grav', True, 1),
 ('Heating_OthW', True, 1),
 ('Heating_Wall', True, 1),
 ('HeatingQC_Fa', False, 37),
 ('HeatingQC_Gd', False, 107),
 ('HeatingQC_Po', True, 1),
 ('HeatingQC_TA', False, 102),
 ('CentralAir_Y', False, 115),
 ('Electrical_FuseF', False, 99),
 ('Electrical_FuseP', False, 63),
 ('Electrical_Mix', True, 1),
 ('Electrical_SBrkr', False, 98),
 ('KitchenQual_Fa', True, 1),
 ('KitchenQual_Gd', True, 1),
 ('KitchenQual_TA', True, 1),
 ('Functional_Maj2', False, 29),
 ('Functional_Min1', False, 54),
 ('Functional_Min2', False, 94),
 ('Functional_Mod', True, 1),
 ('Functional_Sev', True, 1),
 ('Functional_Typ', True, 1),
 ('GarageType_Attchd', True, 1),
 ('GarageType_Basment', True, 1),
 ('GarageType_BuiltIn', True, 1),
 ('GarageType_CarPort', False, 3),
 ('GarageType_Detchd', True, 1),
 ('GarageType_none', True, 1),
 ('GarageFinish_RFn', False, 101),
 ('GarageFinish_Unf', False, 116),
 ('GarageFinish_none', True, 1),
 ('GarageQual_Fa', True, 1),
 ('GarageQual_Gd', True, 1),
 ('GarageQual_Po', True, 1),
 ('GarageQual_TA', True, 1),
 ('GarageQual_none', True, 1),
 ('GarageCond_Fa', True, 1),
 ('GarageCond_Gd', True, 1),
 ('GarageCond_Po', True, 1),
 ('GarageCond_TA', True, 1),
 ('GarageCond_none', True, 1),
 ('PavedDrive_P', False, 34),
 ('PavedDrive_Y', False, 104),
 ('SaleType_CWD', True, 1),
 ('SaleType_Con', True, 1),
 ('SaleType_ConLD', False, 30),
 ('SaleType_ConLI', False, 11),
 ('SaleType_ConLw', False, 123),
 ('SaleType_New', True, 1),
 ('SaleType_Oth', False, 43),
 ('SaleType_WD', False, 125),
 ('SaleCondition_AdjLand', False, 45),
 ('SaleCondition_Alloca', False, 105),
 ('SaleCondition_Family', True, 1),
 ('SaleCondition_Normal', False, 44),
 ('SaleCondition_Partial', False, 41)]
In [49]:
#List of columns selected by RFE
Rfe_Cols = X_train.columns[rfe.support_]
Rfe_Cols
Out[49]:
Index(['LotArea', 'OverallQual', 'OverallCond', 'BsmtUnfSF', 'BedroomAbvGr',
       'KitchenAbvGr', 'TotRmsAbvGrd', 'GarageArea', 'PoolArea', 'MiscVal',
       'Total_sqr_footage', 'Total_Bathrooms', 'Total_Age', 'MSZoning_FV',
       'MSZoning_RL', 'Street_Pave', 'Utilities_NoSeWa', 'LandSlope_Sev',
       'Neighborhood_Crawfor', 'Neighborhood_MeadowV', 'Neighborhood_Mitchel',
       'Neighborhood_NoRidge', 'Neighborhood_NridgHt', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Condition1_Feedr', 'Condition1_Norm',
       'Condition1_PosN', 'Condition1_RRAe', 'Condition1_RRAn',
       'Condition1_RRNe', 'Condition2_PosN', 'Condition2_RRAe',
       'BldgType_Duplex', 'BldgType_Twnhs', 'BldgType_TwnhsE',
       'HouseStyle_2.5Fin', 'RoofStyle_Mansard', 'RoofStyle_Shed',
       'RoofMatl_Metal', 'RoofMatl_Tar&Grv', 'RoofMatl_WdShngl',
       'Exterior1st_AsphShn', 'Exterior1st_CBlock', 'Exterior1st_CemntBd',
       'Exterior1st_Stone', 'Exterior1st_Wd Sdng', 'Exterior2nd_AsphShn',
       'Exterior2nd_BrkFace', 'Exterior2nd_CBlock', 'Exterior2nd_CmentBd',
       'Exterior2nd_Other', 'Exterior2nd_Wd Sdng', 'ExterQual_Gd',
       'ExterQual_TA', 'ExterCond_Po', 'Foundation_Wood', 'BsmtQual_Fa',
       'BsmtQual_Gd', 'BsmtQual_TA', 'BsmtQual_none', 'BsmtCond_Po',
       'BsmtCond_none', 'BsmtExposure_Gd', 'BsmtExposure_none',
       'BsmtFinType1_none', 'BsmtFinType2_none', 'Heating_GasA',
       'Heating_GasW', 'Heating_Grav', 'Heating_OthW', 'Heating_Wall',
       'HeatingQC_Po', 'Electrical_Mix', 'KitchenQual_Fa', 'KitchenQual_Gd',
       'KitchenQual_TA', 'Functional_Mod', 'Functional_Sev', 'Functional_Typ',
       'GarageType_Attchd', 'GarageType_Basment', 'GarageType_BuiltIn',
       'GarageType_Detchd', 'GarageType_none', 'GarageFinish_none',
       'GarageQual_Fa', 'GarageQual_Gd', 'GarageQual_Po', 'GarageQual_TA',
       'GarageQual_none', 'GarageCond_Fa', 'GarageCond_Gd', 'GarageCond_Po',
       'GarageCond_TA', 'GarageCond_none', 'SaleType_CWD', 'SaleType_Con',
       'SaleType_New', 'SaleCondition_Family'],
      dtype='object')
In [50]:
#Creating X_train using RFE selected variables
#We are using the function of statsmodels here
X_train_rfe = X_train[Rfe_Cols] #X_train_rfe will now have all the RFE selected features
X_train_rfe = sm.add_constant(X_train_rfe) # adding the constant c to the variables to form the equation y = mx + c
X_train_rfe.shape
Out[50]:
(940, 101)
In [51]:
#Running the Model
lm = sm.OLS(y_train,X_train_rfe).fit()
In [52]:
#Stats summary of the model 
print(lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.935
Model:                            OLS   Adj. R-squared:                  0.929
Method:                 Least Squares   F-statistic:                     143.1
Date:                Sun, 21 Jan 2024   Prob (F-statistic):               0.00
Time:                        19:35:07   Log-Likelihood:                 1914.1
No. Observations:                 940   AIC:                            -3654.
Df Residuals:                     853   BIC:                            -3233.
Df Model:                          86                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.0700      0.042     -1.676      0.094      -0.152       0.012
LotArea                  0.0391      0.010      3.866      0.000       0.019       0.059
OverallQual              0.1117      0.015      7.623      0.000       0.083       0.141
OverallCond              0.0768      0.010      7.779      0.000       0.057       0.096
BsmtUnfSF                0.0734      0.008      9.253      0.000       0.058       0.089
BedroomAbvGr            -0.0718      0.014     -5.169      0.000      -0.099      -0.045
KitchenAbvGr            -0.0541      0.024     -2.214      0.027      -0.102      -0.006
TotRmsAbvGrd             0.0277      0.015      1.851      0.065      -0.002       0.057
GarageArea               0.0595      0.012      5.065      0.000       0.036       0.083
PoolArea                 0.0427      0.025      1.698      0.090      -0.007       0.092
MiscVal                  0.1449      0.055      2.634      0.009       0.037       0.253
Total_sqr_footage        0.2904      0.015     19.462      0.000       0.261       0.320
Total_Bathrooms          0.0436      0.013      3.478      0.001       0.019       0.068
Total_Age               -0.0884      0.012     -7.377      0.000      -0.112      -0.065
MSZoning_FV             -0.0161      0.011     -1.432      0.153      -0.038       0.006
MSZoning_RL              0.0071      0.004      1.802      0.072      -0.001       0.015
Street_Pave              0.1042      0.027      3.918      0.000       0.052       0.156
Utilities_NoSeWa     -2.046e-16   5.07e-17     -4.035      0.000   -3.04e-16   -1.05e-16
LandSlope_Sev           -0.0370      0.024     -1.512      0.131      -0.085       0.011
Neighborhood_Crawfor     0.0353      0.007      5.126      0.000       0.022       0.049
Neighborhood_MeadowV    -0.0152      0.014     -1.068      0.286      -0.043       0.013
Neighborhood_Mitchel    -0.0167      0.006     -2.650      0.008      -0.029      -0.004
Neighborhood_NoRidge     0.0476      0.008      5.742      0.000       0.031       0.064
Neighborhood_NridgHt     0.0504      0.007      7.317      0.000       0.037       0.064
Neighborhood_Somerst     0.0415      0.010      4.162      0.000       0.022       0.061
Neighborhood_StoneBr     0.0888      0.011      8.087      0.000       0.067       0.110
Condition1_Feedr         0.0214      0.008      2.815      0.005       0.006       0.036
Condition1_Norm          0.0358      0.006      5.791      0.000       0.024       0.048
Condition1_PosN          0.0473      0.013      3.696      0.000       0.022       0.072
Condition1_RRAe         -0.0054      0.013     -0.412      0.680      -0.031       0.020
Condition1_RRAn          0.0227      0.010      2.238      0.026       0.003       0.043
Condition1_RRNe          0.0314      0.034      0.927      0.354      -0.035       0.098
Condition2_PosN         -0.0781      0.037     -2.131      0.033      -0.150      -0.006
Condition2_RRAe         -0.0988      0.033     -3.016      0.003      -0.163      -0.035
BldgType_Duplex         -0.0355      0.009     -3.926      0.000      -0.053      -0.018
BldgType_Twnhs          -0.0387      0.008     -4.711      0.000      -0.055      -0.023
BldgType_TwnhsE         -0.0265      0.006     -4.771      0.000      -0.037      -0.016
HouseStyle_2.5Fin       -0.0666      0.021     -3.143      0.002      -0.108      -0.025
RoofStyle_Mansard        0.0181      0.017      1.057      0.291      -0.016       0.052
RoofStyle_Shed          -0.0988      0.033     -3.016      0.003      -0.163      -0.035
RoofMatl_Metal           0.0374      0.042      0.886      0.376      -0.046       0.120
RoofMatl_Tar&Grv         0.0088      0.026      0.332      0.740      -0.043       0.061
RoofMatl_WdShngl        -0.0324      0.035     -0.935      0.350      -0.100       0.036
Exterior1st_AsphShn   8.144e-18   4.59e-17      0.178      0.859   -8.19e-17    9.82e-17
Exterior1st_CBlock      -0.0087      0.018     -0.494      0.621      -0.043       0.026
Exterior1st_CemntBd     -0.0330      0.025     -1.333      0.183      -0.082       0.016
Exterior1st_Stone       -0.0128      0.035     -0.365      0.715      -0.081       0.056
Exterior1st_Wd Sdng     -0.0178      0.007     -2.377      0.018      -0.033      -0.003
Exterior2nd_AsphShn  -1.417e-17   2.05e-17     -0.691      0.490   -5.44e-17    2.61e-17
Exterior2nd_BrkFace      0.0189      0.010      1.944      0.052      -0.000       0.038
Exterior2nd_CBlock      -0.0087      0.018     -0.494      0.621      -0.043       0.026
Exterior2nd_CmentBd      0.0538      0.025      2.145      0.032       0.005       0.103
Exterior2nd_Other       -0.0332      0.035     -0.938      0.349      -0.103       0.036
Exterior2nd_Wd Sdng      0.0210      0.007      2.818      0.005       0.006       0.036
ExterQual_Gd            -0.0133      0.008     -1.688      0.092      -0.029       0.002
ExterQual_TA            -0.0217      0.008     -2.639      0.008      -0.038      -0.006
ExterCond_Po             0.0605      0.039      1.567      0.118      -0.015       0.136
Foundation_Wood         -0.0519      0.034     -1.511      0.131      -0.119       0.016
BsmtQual_Fa             -0.0520      0.010     -5.013      0.000      -0.072      -0.032
BsmtQual_Gd             -0.0516      0.006     -8.662      0.000      -0.063      -0.040
BsmtQual_TA             -0.0521      0.007     -7.321      0.000      -0.066      -0.038
BsmtQual_none           -0.0014      0.009     -0.164      0.869      -0.019       0.016
BsmtCond_Po              0.0222      0.044      0.507      0.612      -0.064       0.108
BsmtCond_none           -0.0014      0.009     -0.164      0.869      -0.019       0.016
BsmtExposure_Gd          0.0343      0.005      7.268      0.000       0.025       0.044
BsmtExposure_none       -0.0249      0.033     -0.747      0.455      -0.090       0.041
BsmtFinType1_none       -0.0014      0.009     -0.164      0.869      -0.019       0.016
BsmtFinType2_none       -0.0014      0.009     -0.164      0.869      -0.019       0.016
Heating_GasA             0.0023      0.012      0.194      0.846      -0.021       0.026
Heating_GasW            -0.0155      0.015     -1.026      0.305      -0.045       0.014
Heating_Grav            -0.0270      0.026     -1.053      0.293      -0.077       0.023
Heating_OthW            -0.0542      0.031     -1.772      0.077      -0.114       0.006
Heating_Wall             0.0243      0.021      1.143      0.253      -0.017       0.066
HeatingQC_Po          -2.81e-18   1.05e-17     -0.268      0.789   -2.34e-17    1.78e-17
Electrical_Mix          -0.0184      0.061     -0.300      0.764      -0.139       0.102
KitchenQual_Fa          -0.0364      0.011     -3.460      0.001      -0.057      -0.016
KitchenQual_Gd          -0.0366      0.007     -5.488      0.000      -0.050      -0.024
KitchenQual_TA          -0.0380      0.007     -5.188      0.000      -0.052      -0.024
Functional_Mod          -0.0294      0.017     -1.770      0.077      -0.062       0.003
Functional_Sev          -0.1623      0.044     -3.670      0.000      -0.249      -0.076
Functional_Typ           0.0240      0.005      4.559      0.000       0.014       0.034
GarageType_Attchd        0.0105      0.012      0.886      0.376      -0.013       0.034
GarageType_Basment       0.0115      0.017      0.686      0.493      -0.021       0.044
GarageType_BuiltIn       0.0188      0.013      1.449      0.148      -0.007       0.044
GarageType_Detchd        0.0180      0.012      1.553      0.121      -0.005       0.041
GarageType_none          0.0011      0.009      0.115      0.908      -0.017       0.019
GarageFinish_none        0.0011      0.009      0.115      0.908      -0.017       0.019
GarageQual_Fa           -0.0010      0.021     -0.046      0.963      -0.042       0.040
GarageQual_Gd            0.0215      0.023      0.948      0.343      -0.023       0.066
GarageQual_Po           -0.0766      0.038     -1.996      0.046      -0.152      -0.001
GarageQual_TA            0.0093      0.021      0.451      0.652      -0.031       0.050
GarageQual_none          0.0011      0.009      0.115      0.908      -0.017       0.019
GarageCond_Fa           -0.0255      0.021     -1.217      0.224      -0.067       0.016
GarageCond_Gd           -0.0361      0.026     -1.403      0.161      -0.087       0.014
GarageCond_Po            0.0391      0.040      0.987      0.324      -0.039       0.117
GarageCond_TA           -0.0244      0.020     -1.193      0.233      -0.064       0.016
GarageCond_none          0.0011      0.009      0.115      0.908      -0.017       0.019
SaleType_CWD             0.0230      0.020      1.132      0.258      -0.017       0.063
SaleType_Con             0.1121      0.035      3.181      0.002       0.043       0.181
SaleType_New             0.0301      0.005      5.967      0.000       0.020       0.040
SaleCondition_Family    -0.0282      0.009     -2.982      0.003      -0.047      -0.010
==============================================================================
Omnibus:                      272.441   Durbin-Watson:                   2.052
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             3475.595
Skew:                           0.947   Prob(JB):                         0.00
Kurtosis:                      12.228   Cond. No.                     1.06e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.96e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [53]:
#Listing VIF of all columns
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns #Column Names
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])] #VIF value
vif['VIF'] = round(vif['VIF'], 2) #Rounding to 2 decimal places
vif = vif.sort_values(by = "VIF", ascending = False) #arranging in decending order
vif
Out[53]:
Features VIF
95 GarageCond_TA inf
33 Condition2_RRAe inf
96 GarageCond_none inf
44 Exterior1st_CBlock inf
93 GarageCond_Gd inf
50 Exterior2nd_CBlock inf
94 GarageCond_Po inf
61 BsmtQual_none inf
63 BsmtCond_none inf
66 BsmtFinType1_none inf
67 BsmtFinType2_none inf
68 Heating_GasA inf
69 Heating_GasW inf
70 Heating_Grav inf
71 Heating_OthW inf
72 Heating_Wall inf
85 GarageType_none inf
86 GarageFinish_none inf
87 GarageQual_Fa inf
88 GarageQual_Gd inf
89 GarageQual_Po inf
90 GarageQual_TA inf
91 GarageQual_none inf
39 RoofStyle_Shed inf
92 GarageCond_Fa inf
81 GarageType_Attchd 28.80
65 BsmtExposure_none 27.49
84 GarageType_Detchd 23.11
51 Exterior2nd_CmentBd 21.39
45 Exterior1st_CemntBd 20.86
55 ExterQual_TA 13.43
54 ExterQual_Gd 11.65
77 KitchenQual_TA 11.48
60 BsmtQual_TA 10.70
76 KitchenQual_Gd 9.16
59 BsmtQual_Gd 7.45
83 GarageType_BuiltIn 7.36
13 Total_Age 6.72
11 Total_sqr_footage 6.03
47 Exterior1st_Wd Sdng 5.46
53 Exterior2nd_Wd Sdng 5.38
24 Neighborhood_Somerst 5.00
14 MSZoning_FV 4.95
7 TotRmsAbvGrd 4.40
2 OverallQual 4.11
10 MiscVal 3.87
27 Condition1_Norm 3.79
62 BsmtCond_Po 3.48
74 Electrical_Mix 3.42
12 Total_Bathrooms 3.29
8 GarageArea 3.27
6 KitchenAbvGr 2.83
34 BldgType_Duplex 2.65
75 KitchenQual_Fa 2.64
5 BedroomAbvGr 2.64
26 Condition1_Feedr 2.63
58 BsmtQual_Fa 2.57
1 LotArea 2.43
15 MSZoning_RL 2.29
82 GarageType_Basment 2.27
18 LandSlope_Sev 2.17
4 BsmtUnfSF 2.16
41 RoofMatl_Tar&Grv 1.91
36 BldgType_TwnhsE 1.89
79 Functional_Sev 1.78
23 Neighborhood_NridgHt 1.77
30 Condition1_RRAn 1.65
20 Neighborhood_MeadowV 1.64
40 RoofMatl_Metal 1.62
3 OverallCond 1.62
99 SaleType_New 1.52
78 Functional_Mod 1.50
35 BldgType_Twnhs 1.50
28 Condition1_PosN 1.47
80 Functional_Typ 1.43
29 Condition1_RRAe 1.38
19 Neighborhood_Crawfor 1.37
64 BsmtExposure_Gd 1.37
56 ExterCond_Po 1.36
25 Neighborhood_StoneBr 1.30
16 Street_Pave 1.28
32 Condition2_PosN 1.22
22 Neighborhood_NoRidge 1.22
37 HouseStyle_2.5Fin 1.22
49 Exterior2nd_BrkFace 1.19
21 Neighborhood_Mitchel 1.19
52 Exterior2nd_Other 1.14
98 SaleType_Con 1.13
97 SaleType_CWD 1.13
100 SaleCondition_Family 1.13
46 Exterior1st_Stone 1.11
42 RoofMatl_WdShngl 1.09
57 Foundation_Wood 1.07
38 RoofStyle_Mansard 1.06
31 Condition1_RRNe 1.05
9 PoolArea 1.03
0 const 0.00
17 Utilities_NoSeWa NaN
43 Exterior1st_AsphShn NaN
48 Exterior2nd_AsphShn NaN
73 HeatingQC_Po NaN
In [54]:
# Many variables have high correlation
In [55]:
rfe = RFE(lr)
rfe.fit(X_train_rfe,y_train)
col = X_train_rfe.columns[rfe.support_]
X_train_rfe = X_train_rfe[col]
X_train_rfe = sm.add_constant(X_train_rfe)
lm = sm.OLS(y_train,X_train_rfe).fit()
print(lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.915
Model:                            OLS   Adj. R-squared:                  0.911
Method:                 Least Squares   F-statistic:                     225.6
Date:                Sun, 21 Jan 2024   Prob (F-statistic):               0.00
Time:                        19:35:10   Log-Likelihood:                 1789.2
No. Observations:                 940   AIC:                            -3490.
Df Residuals:                     896   BIC:                            -3277.
Df Model:                          43                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                 2.213e+09   1.44e+11      0.015      0.988    -2.8e+11    2.85e+11
OverallQual              0.1325      0.015      8.871      0.000       0.103       0.162
OverallCond              0.0796      0.010      7.607      0.000       0.059       0.100
BsmtUnfSF                0.0830      0.008     10.302      0.000       0.067       0.099
BedroomAbvGr            -0.0809      0.013     -6.225      0.000      -0.106      -0.055
KitchenAbvGr            -0.1384      0.019     -7.242      0.000      -0.176      -0.101
GarageArea               0.0683      0.012      5.741      0.000       0.045       0.092
MiscVal                  0.1199      0.059      2.033      0.042       0.004       0.236
Total_sqr_footage        0.3060      0.013     22.687      0.000       0.279       0.332
Total_Bathrooms          0.0438      0.014      3.204      0.001       0.017       0.071
Total_Age               -0.0977      0.011     -9.052      0.000      -0.119      -0.077
Street_Pave              0.0726      0.027      2.669      0.008       0.019       0.126
Neighborhood_Crawfor     0.0403      0.007      5.635      0.000       0.026       0.054
Neighborhood_NoRidge     0.0480      0.009      5.324      0.000       0.030       0.066
Neighborhood_NridgHt     0.0465      0.007      6.537      0.000       0.033       0.060
Neighborhood_StoneBr     0.0877      0.011      7.733      0.000       0.065       0.110
Condition2_PosN         -0.0655      0.038     -1.727      0.084      -0.140       0.009
Condition2_RRAe      -5.202e+06   3.39e+08     -0.015      0.988    -6.7e+08    6.59e+08
BldgType_Twnhs          -0.0501      0.008     -6.350      0.000      -0.066      -0.035
BldgType_TwnhsE         -0.0391      0.005     -7.405      0.000      -0.050      -0.029
HouseStyle_2.5Fin       -0.0628      0.023     -2.699      0.007      -0.108      -0.017
RoofStyle_Shed        5.202e+06   3.39e+08      0.015      0.988   -6.59e+08     6.7e+08
Exterior1st_CBlock    7.813e+06   5.08e+08      0.015      0.988    -9.9e+08    1.01e+09
Exterior2nd_CBlock   -7.813e+06   5.08e+08     -0.015      0.988   -1.01e+09     9.9e+08
Foundation_Wood         -0.0640      0.038     -1.701      0.089      -0.138       0.010
BsmtQual_Fa             -0.0580      0.011     -5.242      0.000      -0.080      -0.036
BsmtQual_Gd             -0.0559      0.006     -8.903      0.000      -0.068      -0.044
BsmtQual_TA             -0.0596      0.007     -7.963      0.000      -0.074      -0.045
BsmtQual_none        -8.876e+06   5.78e+08     -0.015      0.988   -1.14e+09    1.12e+09
BsmtCond_none         5.548e+06   3.61e+08      0.015      0.988   -7.03e+08    7.14e+08
BsmtFinType1_none    -2.289e+06   1.49e+08     -0.015      0.988   -2.95e+08     2.9e+08
BsmtFinType2_none     5.617e+06   3.66e+08      0.015      0.988   -7.12e+08    7.23e+08
Heating_GasA         -2.213e+09   1.44e+11     -0.015      0.988   -2.85e+11     2.8e+11
Heating_GasW         -2.213e+09   1.44e+11     -0.015      0.988   -2.85e+11     2.8e+11
Heating_Grav         -2.213e+09   1.44e+11     -0.015      0.988   -2.85e+11     2.8e+11
Heating_OthW         -2.213e+09   1.44e+11     -0.015      0.988   -2.85e+11     2.8e+11
Heating_Wall         -2.213e+09   1.44e+11     -0.015      0.988   -2.85e+11     2.8e+11
KitchenQual_Fa          -0.0562      0.011     -5.179      0.000      -0.078      -0.035
KitchenQual_Gd          -0.0502      0.007     -7.494      0.000      -0.063      -0.037
KitchenQual_TA          -0.0586      0.007     -8.032      0.000      -0.073      -0.044
Functional_Sev          -0.1762      0.037     -4.712      0.000      -0.250      -0.103
GarageQual_Fa        -8.906e+08    5.8e+10     -0.015      0.988   -1.15e+11    1.13e+11
GarageQual_Gd        -8.906e+08    5.8e+10     -0.015      0.988   -1.15e+11    1.13e+11
GarageQual_Po        -8.906e+08    5.8e+10     -0.015      0.988   -1.15e+11    1.13e+11
GarageQual_TA        -8.906e+08    5.8e+10     -0.015      0.988   -1.15e+11    1.13e+11
GarageCond_Fa         8.906e+08    5.8e+10      0.015      0.988   -1.13e+11    1.15e+11
GarageCond_Gd         8.906e+08    5.8e+10      0.015      0.988   -1.13e+11    1.15e+11
GarageCond_Po         8.906e+08    5.8e+10      0.015      0.988   -1.13e+11    1.15e+11
GarageCond_TA         8.906e+08    5.8e+10      0.015      0.988   -1.13e+11    1.15e+11
SaleType_Con             0.1669      0.038      4.438      0.000       0.093       0.241
SaleType_New             0.0353      0.005      6.614      0.000       0.025       0.046
==============================================================================
Omnibus:                      248.804   Durbin-Watson:                   2.046
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2677.623
Skew:                           0.884   Prob(JB):                         0.00
Kurtosis:                      11.077   Cond. No.                     1.09e+16
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 5.6e-29. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
In [56]:
#Listing VIF of all columns
vif = pd.DataFrame()
vif['Features'] = X_train_rfe.columns #Column Names
vif['VIF'] = [variance_inflation_factor(X_train_rfe.values, i) for i in range(X_train_rfe.shape[1])] #VIF value
vif['VIF'] = round(vif['VIF'], 2) #Rounding to 2 decimal places
vif = vif.sort_values(by = "VIF", ascending = False) #arranging in decending order
vif
Out[56]:
Features VIF
45 GarageCond_Fa inf
22 Exterior1st_CBlock inf
17 Condition2_RRAe inf
44 GarageQual_TA inf
21 RoofStyle_Shed inf
32 Heating_GasA inf
46 GarageCond_Gd inf
47 GarageCond_Po inf
48 GarageCond_TA inf
23 Exterior2nd_CBlock inf
42 GarageQual_Gd inf
41 GarageQual_Fa inf
33 Heating_GasW inf
34 Heating_Grav inf
35 Heating_OthW inf
36 Heating_Wall inf
43 GarageQual_Po inf
28 BsmtQual_none 6145824.33
29 BsmtCond_none 6145824.33
30 BsmtFinType1_none 6145824.33
31 BsmtFinType2_none 6145824.33
27 BsmtQual_TA 9.53
39 KitchenQual_TA 9.16
38 KitchenQual_Gd 7.44
26 BsmtQual_Gd 6.66
10 Total_Age 4.38
8 Total_sqr_footage 3.97
7 MiscVal 3.58
1 OverallQual 3.44
9 Total_Bathrooms 3.15
6 GarageArea 2.70
25 BsmtQual_Fa 2.35
37 KitchenQual_Fa 2.26
4 BedroomAbvGr 1.86
3 BsmtUnfSF 1.79
14 Neighborhood_NridgHt 1.52
2 OverallCond 1.47
5 KitchenAbvGr 1.39
19 BldgType_TwnhsE 1.38
50 SaleType_New 1.37
20 HouseStyle_2.5Fin 1.19
12 Neighborhood_Crawfor 1.19
13 Neighborhood_NoRidge 1.17
15 Neighborhood_StoneBr 1.12
18 BldgType_Twnhs 1.11
11 Street_Pave 1.08
16 Condition2_PosN 1.05
24 Foundation_Wood 1.04
49 SaleType_Con 1.03
40 Functional_Sev 1.02
0 const 0.00
In [57]:
rfe=rfe.fit(X_train_rfe,y_train)
In [58]:
print(X_train_rfe.shape)
print(y_train.shape)
(940, 51)
(940,)
In [59]:
y_train_pred = rfe.predict(X_train_rfe)
In [60]:
## Train score
In [61]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#Returns the mean squared error; we'll take a square root
np.sqrt(mean_squared_error(y_train, y_train_pred))
Out[61]:
0.040582976524661515

On average, the squared difference between the actual and predicted values is On average, the squared difference between the actual and predicted values is 5.22.

In [62]:
r_squared = r2_score(y_train, y_train_pred)
r_squared
Out[62]:
0.892961031888519
In [63]:
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train_rfe.columns]
print(len(columns_to_drop))
print(X_test.shape)
174
(403, 224)
In [64]:
## Test score
In [65]:
X_test_rfe = X_test.drop(columns=columns_to_drop)
X_test_rfe.shape
Out[65]:
(403, 50)
In [66]:
rfe=rfe.fit(X_test_rfe,y_test)
y_test_pred = rfe.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0.03943445902844407
0.9024503770845567
In [ ]:
 
In [ ]:
 

Model 2: Ridge and Lasso¶

In [67]:
## lets perform Rige and Lasso on the columns given by rfe

Ridge - Regularization¶

In [68]:
print(X_train_rfe.shape)
print(y_train.shape)
X_train = X_train_rfe
(940, 51)
(940,)
In [69]:
# Considering following alphas
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
                    9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()

# cross validation
folds = 5
ridge_model_cv = GridSearchCV(estimator = ridge, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
ridge_model_cv.fit(X_train, y_train)
Fitting 5 folds for each of 27 candidates, totalling 135 fits
Out[69]:
GridSearchCV(cv=5, estimator=Ridge(),
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5,
                                   0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0,
                                   6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500,
                                   1000]},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=Ridge(),
             param_grid={'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5,
                                   0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0,
                                   6.0, 7.0, 8.0, 9.0, 10.0, 20, 50, 100, 500,
                                   1000]},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)
Ridge()
Ridge()
In [70]:
ridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results = ridge_cv_results[ridge_cv_results['param_alpha']<=500]
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
Out[70]:
param_alpha mean_train_score mean_test_score rank_test_score
12 1.0 -0.025526 -0.027323 1
11 0.9 -0.025489 -0.027327 2
10 0.8 -0.025451 -0.027334 3
9 0.7 -0.025413 -0.027346 4
8 0.6 -0.025376 -0.027366 5
7 0.5 -0.025339 -0.027397 6
13 2.0 -0.025916 -0.027405 7
6 0.4 -0.025302 -0.027437 8
5 0.3 -0.025264 -0.027487 9
4 0.2 -0.025223 -0.027550 10
14 3.0 -0.026298 -0.027628 11
3 0.1 -0.025181 -0.027636 12
2 0.01 -0.025145 -0.027752 13
1 0.001 -0.025144 -0.027767 14
0 0.0001 -0.025144 -0.027768 15
15 4.0 -0.026668 -0.027879 16
16 5.0 -0.027050 -0.028167 17
17 6.0 -0.027436 -0.028490 18
18 7.0 -0.027829 -0.028839 19
19 8.0 -0.028237 -0.029198 20
20 9.0 -0.028652 -0.029593 21
21 10.0 -0.029070 -0.029998 22
22 20 -0.032929 -0.033775 23
23 50 -0.041236 -0.041909 24
24 100 -0.049096 -0.049710 25
25 500 -0.067564 -0.067853 26
In [ ]:
 
In [71]:
# plotting Negative Mean Absolute Error vs alpha for train and test

ridge_cv_results['param_alpha'] = ridge_cv_results['param_alpha'].astype('int32')

plt.figure(figsize=(8,5))
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_train_score'])
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Ridge regression)')
plt.title("Negative Mean Absolute Error and alpha\n",fontsize=15)
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
No description has been provided for this image
In [72]:
ridge_model_cv.best_params_
Out[72]:
{'alpha': 1.0}
In [73]:
## Train score
In [74]:
# Hyperparameter lambda = 1.0
alpha = 1.0
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
#ridge.coef_
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, ridge.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = ridge.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is  0.0013402777692339
The r2 value of train data is  0.9128938268574391
In [75]:
## Test score
In [76]:
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train.columns]
print(len(columns_to_drop))
print(X_test.shape)
X_test_rfe = X_test.drop(columns=columns_to_drop)
ridge.fit(X_test_rfe,y_test)
y_test_pred = ridge.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
174
(403, 224)
0.039247725758071256
0.9033720396193562
In [77]:
model_parameter = list(ridge.coef_)
model_parameter.insert(0,ridge.intercept_)
cols = X_test.columns
cols.insert(0,'constant')
ridge_coef = pd.DataFrame(list(zip(cols,model_parameter)))
ridge_coef.columns = ['Feaure','Coef']
In [78]:
ridge_coef.sort_values(by='Coef',ascending=False).head(10)
Out[78]:
Feaure Coef
8 LowQualFinSF 0.237707
1 LotFrontage 0.181180
12 Fireplaces 0.093996
0 MSSubClass 0.090422
2 LotArea 0.079129
6 BsmtUnfSF 0.062856
15 PoolArea 0.056957
13 GarageCars 0.054825
9 BedroomAbvGr 0.051661
14 GarageArea 0.046775

9.3 Lasso - Regularization¶

In [79]:
lasso = Lasso()

# Considering following alphas
params = {'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01]}

# cross validation
folds = 5
lasso_model_cv = GridSearchCV(estimator = lasso,                         
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            

lasso_model_cv.fit(X_train, y_train) 
Fitting 5 folds for each of 11 candidates, totalling 55 fits
Out[79]:
GridSearchCV(cv=5, estimator=Lasso(),
             param_grid={'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005,
                                   0.001, 0.002, 0.003, 0.004, 0.005, 0.01]},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=Lasso(),
             param_grid={'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005,
                                   0.001, 0.002, 0.003, 0.004, 0.005, 0.01]},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)
Lasso()
Lasso()
In [80]:
lasso_cv_results = pd.DataFrame(lasso_model_cv.cv_results_)
lasso_cv_results = lasso_cv_results[lasso_cv_results['param_alpha']<=500]
lasso_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
Out[80]:
param_alpha mean_train_score mean_test_score rank_test_score
0 0.0001 -0.025815 -0.027356 1
1 0.0002 -0.026394 -0.027602 2
2 0.0003 -0.026897 -0.028060 3
3 0.0004 -0.027545 -0.028690 4
4 0.0005 -0.028305 -0.029426 5
5 0.001 -0.031726 -0.032461 6
6 0.002 -0.034447 -0.034865 7
7 0.003 -0.036620 -0.036985 8
8 0.004 -0.039353 -0.039719 9
9 0.005 -0.042696 -0.043141 10
10 0.01 -0.060830 -0.061227 11
In [81]:
# plotting Negative Mean Absolute Error vs alpha for train and test

lasso_cv_results['param_alpha'] = lasso_cv_results['param_alpha'].astype('float64')

plt.figure(figsize=(10,8))
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_train_score'])
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Lasso regression)')

plt.title("Negative Mean Absolute Error and alpha (Lasso regression)")
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
No description has been provided for this image
In [82]:
# lambda best estimator
lasso_model_cv.best_estimator_
Out[82]:
Lasso(alpha=0.0001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.0001)
In [83]:
# # Hyperparameter lambda = 0001

alpha = 0.0001

lasso = Lasso(alpha=alpha)
        
lasso.fit(X_train, y_train) 
lasso.coef_
Out[83]:
array([ 0.        ,  0.14936811,  0.07683105,  0.07566492, -0.07053207,
       -0.11401635,  0.05506468, -0.        ,  0.29853129,  0.04153133,
       -0.10228039,  0.01518065,  0.03516087,  0.04418029,  0.04738503,
        0.08101888, -0.        , -0.        , -0.04706271, -0.03554114,
       -0.02340574, -0.        , -0.        , -0.        , -0.        ,
       -0.04301898, -0.04935221, -0.05051161, -0.02254578, -0.        ,
       -0.        , -0.        ,  0.00277795, -0.00666772, -0.        ,
       -0.        ,  0.        , -0.04908881, -0.04489768, -0.05603447,
       -0.07706967, -0.00870119,  0.00747869, -0.        , -0.00106238,
       -0.        , -0.        , -0.        , -0.        ,  0.06879055,
        0.03647498])
In [84]:
## X_train score
In [85]:
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, lasso.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is  0.0013669680543668022
The r2 value of train data is  0.9111591949390572
In [86]:
## test score
In [87]:
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train.columns]
print(len(columns_to_drop))
print(X_test.shape)
X_test_rfe = X_test.drop(columns=columns_to_drop)
lasso.fit(X_test_rfe,y_test)
y_test_pred = lasso.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
174
(403, 224)
0.03895489153666429
0.9048085770170703
In [88]:
model_param = list(lasso.coef_)
model_param.insert(0,lasso.intercept_)
cols = df_train.columns
cols.insert(0,'const')
lasso_coef = pd.DataFrame(list(zip(cols,model_param)))
lasso_coef.columns = ['Featuere','Coef']
In [89]:
lasso_coef.sort_values(by='Coef',ascending=False).head(10)
Out[89]:
Featuere Coef
8 LowQualFinSF 0.281284
1 LotFrontage 0.211622
12 Fireplaces 0.090464
2 LotArea 0.089172
0 MSSubClass 0.075230
6 BsmtUnfSF 0.053667
15 PoolArea 0.051527
3 OverallQual 0.045568
13 GarageCars 0.044817
14 GarageArea 0.043813

Model 3: Lets perform ridge and Lasso without rfe¶

In [90]:
#Here we will keep the train size as 70% and automatically test size will be rest 30%
#Also we will keep random_state as a fixed value of 100 so that the data set does no changes
df_train,df_test = train_test_split(df_house, train_size = 0.7, random_state = 100)
print ("The Size of Train data is",df_train.shape)
print ("The Size of Test data is",df_test.shape)
Scaler = MinMaxScaler() # Instantiate an objectr
#Note-The above order of columns in num_cols should be same in df_test, otherwise we will get a wrong r-square value
df_train[Num_cols] = Scaler.fit_transform(df_train[Num_cols])
df_test[Num_cols] = Scaler.transform(df_test[Num_cols])
#Define X_train and y_train
y_train = df_train.pop('SalePrice') #This contains only the Target Variable
X_train = df_train #This contains all the Independent Variables except the Target Variable
#Since 'SalePrice' is the target variable we will keep it only on y-train and remove it from X_train
y_test = df_test.pop('SalePrice') #This contains only the Target Variable
X_test = df_test #This contains all the Independent Variables except the Target Variable
The Size of Train data is (940, 225)
The Size of Test data is (403, 225)
In [91]:
# Considering following alphas
params = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 
                    9.0, 10.0, 20, 50, 100, 500, 1000 ]}

ridge = Ridge()

# cross validation
folds = 5
ridge_model_cv = GridSearchCV(estimator = ridge, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            
ridge_model_cv.fit(X_train, y_train)     
ridge_cv_results = pd.DataFrame(ridge_model_cv.cv_results_)
ridge_cv_results = ridge_cv_results[ridge_cv_results['param_alpha']<=500]
ridge_cv_results[['param_alpha', 'mean_train_score', 'mean_test_score', 'rank_test_score']].sort_values(by = ['rank_test_score'])
Fitting 5 folds for each of 27 candidates, totalling 135 fits
Out[91]:
param_alpha mean_train_score mean_test_score rank_test_score
14 3.0 -0.020806 -0.025553 1
13 2.0 -0.020430 -0.025573 2
15 4.0 -0.021161 -0.025599 3
16 5.0 -0.021491 -0.025705 4
17 6.0 -0.021807 -0.025842 5
12 1.0 -0.020001 -0.025907 6
11 0.9 -0.019951 -0.025985 7
18 7.0 -0.022118 -0.025992 8
10 0.8 -0.019897 -0.026080 9
19 8.0 -0.022414 -0.026167 10
9 0.7 -0.019840 -0.026206 11
8 0.6 -0.019779 -0.026355 12
20 9.0 -0.022699 -0.026366 13
7 0.5 -0.019715 -0.026531 14
21 10.0 -0.022971 -0.026571 15
6 0.4 -0.019642 -0.026746 16
5 0.3 -0.019561 -0.027049 17
4 0.2 -0.019477 -0.027454 18
3 0.1 -0.019372 -0.028063 19
22 20 -0.025374 -0.028461 20
2 0.01 -0.019243 -0.029159 21
1 0.001 -0.019231 -0.029368 22
0 0.0001 -0.019230 -0.029392 23
23 50 -0.030367 -0.032800 24
24 100 -0.035664 -0.037605 25
25 500 -0.050757 -0.051781 26
In [92]:
# plotting Negative Mean Absolute Error vs alpha for train and test

ridge_cv_results['param_alpha'] = ridge_cv_results['param_alpha'].astype('int32')

plt.figure(figsize=(8,5))
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_train_score'])
plt.plot(ridge_cv_results['param_alpha'], ridge_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Ridge regression)')
plt.title("Negative Mean Absolute Error and alpha\n",fontsize=15)
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
No description has been provided for this image
In [93]:
# lambda best estimator
ridge_model_cv.best_estimator_
Out[93]:
Ridge(alpha=3.0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=3.0)
In [94]:
# Hyperparameter lambda = 1.0
alpha = 3.0
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
#ridge.coef_
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, ridge.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = ridge.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is  0.0009472129960129189
The r2 value of train data is  0.9384395525110093
In [95]:
ridge.fit(X_test,y_test)
y_test_pred = ridge.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0.03115501423756808
0.9391122777021567
In [96]:
ridge.fit(X_test,y_test)
y_test_pred = ridge.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0.03115501423756808
0.9391122777021567
In [97]:
lasso = Lasso()

# Considering following alphas
params = {'alpha': [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.001, 0.002, 0.003, 0.004, 0.005, 0.01]}

# cross validation
folds = 5
lasso_model_cv = GridSearchCV(estimator = lasso,                         
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = folds, 
                        return_train_score=True,
                        verbose = 1)            

lasso_model_cv.fit(X_train, y_train) 
lasso_cv_results = pd.DataFrame(lasso_model_cv.cv_results_)
lasso_cv_results = lasso_cv_results[lasso_cv_results['param_alpha']<=500]
# plotting Negative Mean Absolute Error vs alpha for train and test

lasso_cv_results['param_alpha'] = lasso_cv_results['param_alpha'].astype('float64')

plt.figure(figsize=(10,8))
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_train_score'])
plt.plot(lasso_cv_results['param_alpha'], lasso_cv_results['mean_test_score'])
plt.xlabel('alpha parameter')
plt.ylabel('Negative Mean Absolute Error (Lasso regression)')

plt.title("Negative Mean Absolute Error and alpha (Lasso regression)")
plt.legend(['train score', 'test score'], loc='upper right')
plt.show()
Fitting 5 folds for each of 11 candidates, totalling 55 fits
No description has been provided for this image
In [98]:
# lambda best estimator
lasso_model_cv.best_estimator_
Out[98]:
Lasso(alpha=0.0001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.0001)
In [99]:
# # Hyperparameter lambda = 0001

alpha = 0.0001

lasso = Lasso(alpha=alpha)
        
lasso.fit(X_train, y_train) 
lasso.coef_
Out[99]:
array([-3.47046277e-02,  8.01568342e-03,  3.42401618e-02,  1.23318454e-01,
        6.82283981e-02,  0.00000000e+00,  3.53450845e-02,  5.59990563e-02,
       -0.00000000e+00, -2.58846205e-02, -0.00000000e+00,  2.32952266e-02,
        1.53778127e-02,  6.08378698e-03,  4.33094507e-02,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -9.54102984e-04,  2.24896319e-01,
        3.38439466e-02,  1.69408311e-02, -6.42945323e-02, -0.00000000e+00,
       -1.13202566e-02,  0.00000000e+00,  0.00000000e+00,  1.50910996e-02,
        0.00000000e+00,  2.23823656e-02,  1.29270450e-03, -0.00000000e+00,
        2.31426375e-03,  0.00000000e+00, -1.73552569e-02, -4.21352878e-03,
        0.00000000e+00,  1.13113534e-02, -6.94123098e-03, -0.00000000e+00,
        0.00000000e+00, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        9.26601542e-05,  1.23042610e-02,  0.00000000e+00, -2.59796427e-03,
        2.11858172e-02, -1.09035173e-02, -1.14740138e-02,  0.00000000e+00,
       -0.00000000e+00, -2.20161775e-02, -1.13770719e-02,  0.00000000e+00,
       -7.45555850e-03,  4.01461357e-02,  4.22125412e-02, -9.74129425e-03,
       -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,  2.96928953e-02,
        7.36635160e-02, -3.22058254e-03,  1.33742913e-02,  2.84363743e-03,
        1.88417586e-02, -0.00000000e+00,  8.82123126e-03, -2.28903958e-02,
        2.16786363e-03,  0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -2.71724364e-02, -5.31562868e-03,
       -0.00000000e+00,  0.00000000e+00, -4.21334072e-03, -1.84993196e-02,
       -0.00000000e+00,  9.45120728e-03,  0.00000000e+00,  0.00000000e+00,
        0.00000000e+00,  0.00000000e+00, -1.97645200e-04,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        0.00000000e+00,  0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        2.24215195e-02, -0.00000000e+00,  0.00000000e+00, -6.40076705e-03,
       -0.00000000e+00,  2.01312931e-03, -5.27984329e-03,  0.00000000e+00,
        3.64106922e-03, -0.00000000e+00, -1.34094567e-03,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        1.57484001e-02, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00, -7.07074801e-03,  0.00000000e+00,  0.00000000e+00,
       -0.00000000e+00,  0.00000000e+00, -7.85380618e-03, -0.00000000e+00,
       -8.37582339e-03, -1.28827846e-02, -0.00000000e+00, -2.68891873e-03,
        0.00000000e+00,  4.66932027e-03,  0.00000000e+00,  2.46027533e-03,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -2.58161396e-02,
       -4.39301138e-02, -3.78446023e-02, -0.00000000e+00,  1.36415440e-03,
        0.00000000e+00,  3.24563031e-03, -0.00000000e+00,  3.19780600e-02,
       -3.62154800e-03, -3.97134955e-03, -0.00000000e+00, -0.00000000e+00,
        7.45002184e-03, -4.90475851e-03, -3.48629827e-03, -2.77453492e-04,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
       -5.69212437e-03,  2.75652775e-03, -0.00000000e+00,  6.86128403e-03,
       -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -2.73074823e-03,  0.00000000e+00, -4.82297695e-03,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
       -1.10429425e-03, -3.04929184e-02, -3.29164145e-02, -3.34439055e-02,
       -0.00000000e+00,  0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
       -3.89309875e-03,  2.36142689e-02, -0.00000000e+00, -0.00000000e+00,
        1.62587974e-02, -0.00000000e+00,  4.09044669e-03,  0.00000000e+00,
       -3.26021116e-03, -2.01877549e-03,  0.00000000e+00, -1.07349659e-02,
        6.30237225e-04, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
       -5.82144061e-04, -0.00000000e+00, -0.00000000e+00,  0.00000000e+00,
        0.00000000e+00, -1.87813737e-03,  1.23453328e-03,  0.00000000e+00,
        0.00000000e+00, -0.00000000e+00,  0.00000000e+00, -0.00000000e+00,
        5.52996840e-03,  0.00000000e+00, -5.38114145e-03,  0.00000000e+00,
       -0.00000000e+00, -1.28284412e-02,  1.23104872e-02,  3.16934018e-02])
In [100]:
#Lets calculate the mean squared error value
mse = mean_squared_error(y_train, lasso.predict(X_train))
print("The mean squared error value is ",mse)
# predicting the R2 value of train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
The mean squared error value is  0.0009707838136888879
The r2 value of train data is  0.9369076583225618
In [101]:
# Create a list of columns to drop from X_test
columns_to_drop = [col for col in X_test.columns if col not in X_train.columns]
print(len(columns_to_drop))
print(X_test.shape)
X_test_rfe = X_test.drop(columns=columns_to_drop)
lasso.fit(X_test_rfe,y_test)
y_test_pred = lasso.predict(X_test_rfe)
print(np.sqrt(mean_squared_error(y_test, y_test_pred)))
print(r2_score(y_test, y_test_pred))
0
(403, 224)
0.02966170849560128
0.9448092691620592

10. Conclusion :¶

The optimal value of LAMBDA we got in case of Ridge and Lasso is :

  • Ridge - 3.0
  • Lasso - 0.0001
In [ ]:
 
In [ ]:
 

Assignment Part - II ( Finding solutions for the subjective questions, Please refer the PDF file for the complete answers)¶

Question 1 : What will be the changes in the model if you choose double the value of alpha for both ridge and lasso?

In [102]:
#Lets find for Ridge first
alpha = 3.0 # Optimal value of alpha is 3
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
print("The output when alpha is 4: ")
mse = mean_squared_error(y_train, ridge.predict(X_train))
print("The mean squared error value of train data is ",mse)
mse = mean_squared_error(y_test, ridge.predict(X_test))
print("The mean squared error value of test data is ",mse)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
print()
The output when alpha is 4: 
The mean squared error value of train data is  0.0009472129960129189
The mean squared error value of test data is  0.0015129109188580304
The r2 value of train data is  0.9369076583225618
The r2 value of test data is  0.9448092691620592

In [103]:
alpha = 6.0 
ridge = Ridge(alpha=alpha)
ridge.fit(X_train, y_train)
ridge.coef_
print("The output when alpha is 6: ")
mse = mean_squared_error(y_test, ridge.predict(X_test))
print("The mean squared error value is ",mse)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
The output when alpha is 6: 
The mean squared error value is  0.0015295259041306838
The r2 value of train data is  0.9369076583225618
The r2 value of test data is  0.9448092691620592
In [104]:
#Let's create a ridge model with alpha  = 4.0
ridge_doubled = Ridge(alpha = 4.0)
ridge_doubled.fit(X_train,y_train)
y_train_ridge_pred_doubled = ridge_doubled.predict(X_train)
y_test_ridge_pred_doubled = ridge_doubled.predict(X_test)

ridge_coef_doubled_df = pd.DataFrame(ridge_doubled.coef_ , columns = ['Coefficient'], index =  X_train.columns)
print("Top predictor features for ridge when alpha is 6 are :\n")
print(ridge_coef_doubled_df.sort_values(by = 'Coefficient', ascending = False).head(10))
Top predictor features for ridge when alpha is 6 are :

                      Coefficient
Total_sqr_footage        0.134351
OverallQual              0.088469
TotalBsmtSF              0.079235
Neighborhood_StoneBr     0.058877
TotRmsAbvGrd             0.046711
Total_Bathrooms          0.045971
OverallCond              0.044430
GarageArea               0.044175
LotArea                  0.038149
Neighborhood_NoRidge     0.035671
In [105]:
#Now lets calculate for Lasso
alpha = 0.0001 #Optimal Value of alpha
lasso = Lasso(alpha=alpha)      
lasso.fit(X_train, y_train) 
lasso.coef_# mse
print("The output when alpha is 0.0001: ")
#Lets calculate the mean squared error value
mse = mean_squared_error(y_test, lasso.predict(X_test))
print("The mean squared error value is ",mse)
#predicting the R2 value on train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
#predicting the R2 value on test data
y_test_pred = lasso.predict(X_test)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
print()
alpha = 0.0002 #Optimal Value of alpha
lasso = Lasso(alpha=alpha)      
lasso.fit(X_train, y_train) 
lasso.coef_# mse
print("The output when alpha is 0.0002: ")
#Lets calculate the mean squared error value
mse = mean_squared_error(y_test, lasso.predict(X_test))
print("The mean squared error value is ",mse)
#predicting the R2 value on train data
y_train_pred = lasso.predict(X_train)
r2_train = metrics.r2_score(y_true=y_train, y_pred=y_train_pred)
print("The r2 value of train data is ",r2_train)
#predicting the R2 value on test data
y_test_pred = lasso.predict(X_test)
r2_test = metrics.r2_score(y_true=y_test, y_pred=y_test_pred)
print("The r2 value of test data is ",r2_test)
The output when alpha is 0.0001: 
The mean squared error value is  0.0014895373420533808
The r2 value of train data is  0.9369076583225618
The r2 value of test data is  0.9065616382631766

The output when alpha is 0.0002: 
The mean squared error value is  0.0014958564828966149
The r2 value of train data is  0.9313084900646047
The r2 value of test data is  0.9061652398975188
In [106]:
#Let's create a lasso model with alpha  = 0.0002 
lasso_doubled = Lasso(alpha=0.0002)
lasso_doubled.fit(X_train,y_train)
y_train_pred_doubled = lasso_doubled.predict(X_train)
y_test_pred_doubled = lasso_doubled.predict(X_test)
lasso_coef_doubled_df = pd.DataFrame(lasso_doubled.coef_ , columns = ['Coefficient'], index =  X_train.columns)
print("Top correlated features of Lasso when alpha is 0.0002 are:\n")
print(lasso_coef_doubled_df.sort_values(by = 'Coefficient', ascending = False).head(10))
Top correlated features of Lasso when alpha is 0.0002 are:

                       Coefficient
Total_sqr_footage         0.225682
OverallQual               0.133786
Neighborhood_StoneBr      0.065950
TotalBsmtSF               0.060540
OverallCond               0.058713
Neighborhood_NridgHt      0.044229
GarageArea                0.043697
Neighborhood_NoRidge      0.035138
SaleCondition_Partial     0.032983
BsmtExposure_Gd           0.030938
In [ ]:
 

Question 3 :After building the model, you realised that the five most important predictor variables in the lasso model are not available in the incoming data. You will now have to create another model excluding the five most important predictor variables. Which are the five most important predictor variables now?

In [107]:
X_train.info()
<class 'pandas.core.frame.DataFrame'>
Index: 940 entries, 1146 to 858
Columns: 224 entries, MSSubClass to SaleCondition_Partial
dtypes: float64(25), int64(199)
memory usage: 1.6 MB
In [108]:
alpha = 0.0001 #Optimal Value of alpha
lasso = Lasso(alpha=alpha)      
lasso.fit(X_train, y_train) 
lasso_pred =  pd.DataFrame(lasso_doubled.coef_ , columns = ['Coefficient'], index =  X_train.columns)
print(lasso_pred.sort_values(by = 'Coefficient', ascending = False).head(10))
top_5_predictors = lasso_pred['Coefficient'].nlargest(5).index
X_train_dropped = X_train.drop(columns=top_5_predictors)
#print(X_train_dropped.columns)
print('r2 score',metrics.r2_score(y_true=y_train, y_pred= lasso.predict(X_train)))
                       Coefficient
Total_sqr_footage         0.225682
OverallQual               0.133786
Neighborhood_StoneBr      0.065950
TotalBsmtSF               0.060540
OverallCond               0.058713
Neighborhood_NridgHt      0.044229
GarageArea                0.043697
Neighborhood_NoRidge      0.035138
SaleCondition_Partial     0.032983
BsmtExposure_Gd           0.030938
r2 score 0.9369076583225618
In [109]:
#Lasso prediction by dropping top5 predictors
lasso.fit(X_train_dropped, y_train) 
lasso_pred =  pd.DataFrame(lasso.coef_ , columns = ['Coefficient'], index =  X_train_dropped.columns)
y_train_pred = lasso.predict(X_train_dropped)
In [110]:
print('Applying Lasso after dropping 5 predictor variables')
print(lasso_pred.sort_values(by = 'Coefficient', ascending = False).head(10))
print('r2 score',metrics.r2_score(y_true=y_train, y_pred=y_train_pred))
Applying Lasso after dropping 5 predictor variables
                       Coefficient
TotRmsAbvGrd              0.129455
Total_Bathrooms           0.112029
GarageArea                0.105183
Fireplaces                0.051469
LotArea                   0.048502
Street_Pave               0.047245
Neighborhood_NoRidge      0.044297
BsmtExposure_Gd           0.042970
BsmtUnfSF                 0.040704
SaleCondition_Partial     0.033489
r2 score 0.9050034157050509
In [ ]:
 
In [ ]: